Skip to content

[AURON #2236] Support compile with Celeborn 0.6 and spark 4.0#2243

Merged
cxzl25 merged 3 commits intoapache:masterfrom
ftong2020:celeborn-0-6-with-spark-4-0
May 9, 2026
Merged

[AURON #2236] Support compile with Celeborn 0.6 and spark 4.0#2243
cxzl25 merged 3 commits intoapache:masterfrom
ftong2020:celeborn-0-6-with-spark-4-0

Conversation

@ftong2020
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2236

Rationale for this change

Spark 4.0 has been release for over 1 year, and celeborn 0.6 provides official spark 4.0 support.

This change will add spark 4.0 support when auron is compiled with spark 4.0 and celeborn 0.6.

Spark 4.1 supported is not included in this PR since latest celeborn version does not support this configuration. Spark 4.1 also changed signature of Mapstatus.apply(), which demands heavy changes to codebase.

What changes are included in this PR?

Are there any user-facing changes?

no

How was this patch tested?

Tested in our staging env

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Spark 4.0 compilation support when building Auron with Celeborn 0.6 by updating Spark-version-gated APIs and aligning Spark 4’s shuffle-write execution behavior with existing Spark 3.x assumptions.

Changes:

  • Extend @sparkver gating for getPartitionLengths() to include Spark 4.0 in Celeborn 0.6 and RSS shuffle writer implementations.
  • Update NativeRDD.compute() to also defer execution for RSS_SHUFFLE_WRITER plans on Spark 4+ (to avoid early execution under the refined ShuffleWriteProcessor flow).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
thirdparty/auron-celeborn-0.6/src/main/scala/org/apache/spark/sql/execution/auron/shuffle/celeborn/AuronCelebornShuffleWriter.scala Adds Spark 4.0 to the version-gated getPartitionLengths() override for Celeborn 0.6 integration.
spark-extension/src/main/scala/org/apache/spark/sql/execution/auron/shuffle/AuronRssShuffleWriterBase.scala Adds Spark 4.0 to the version-gated getPartitionLengths() override for RSS shuffle writer base.
spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeRDD.scala Extends the Spark 4+ shuffle-write deferral logic to cover RSS shuffle writer plans as well.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeRDD.scala Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@yew1eb
Copy link
Copy Markdown
Contributor

yew1eb commented May 7, 2026

@ftong2020 Thanks for the contribution! Please update the PR title to include the issue ID: [AURON #2236]

@ftong2020 ftong2020 changed the title support compile with Celeborn 0.6 and spark 4.0 [AURON #2236]support compile with Celeborn 0.6 and spark 4.0 May 8, 2026
@ftong2020
Copy link
Copy Markdown
Contributor Author

@ftong2020 Thanks for the contribution! Please update the PR title to include the issue ID: [AURON #2236]

sure

@cxzl25 cxzl25 changed the title [AURON #2236]support compile with Celeborn 0.6 and spark 4.0 [AURON #2236] Support compile with Celeborn 0.6 and spark 4.0 May 8, 2026
@cxzl25
Copy link
Copy Markdown
Contributor

cxzl25 commented May 8, 2026

Spark 4.1 supported is not included in this PR since latest celeborn version does not support this configuration. Spark 4.1 also changed signature of Mapstatus.apply(), which demands heavy changes to codebase.

The newly added parameters of MapStatus.apply have default values. Does this have any impact on supporting Spark4.1?

  def apply(
      loc: BlockManagerId,
      uncompressedSizes: Array[Long],
      mapTaskId: Long,
      checksumVal: Long = 0): MapStatus = {

@ftong2020
Copy link
Copy Markdown
Contributor Author

Spark 4.1 supported is not included in this PR since latest celeborn version does not support this configuration. Spark 4.1 also changed signature of Mapstatus.apply(), which demands heavy changes to codebase.

The newly added parameters of MapStatus.apply have default values. Does this have any impact on supporting Spark4.1?

  def apply(
      loc: BlockManagerId,
      uncompressedSizes: Array[Long],
      mapTaskId: Long,
      checksumVal: Long = 0): MapStatus = {

I tried it, it is celeborn-client itself that does not support spark 4.1. Spark will throw following exception no matter auron is enabled or not.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (4a802e9ef77a executor driver): java.lang.NoSuchMethodError: 'org.apache.spark.scheduler.MapStatus org.apache.spark.scheduler.MapStatus$.apply(org.apache.spark.storage.BlockManagerId, long[], long)'
at org.apache.spark.shuffle.celeborn.SparkUtils.createMapStatus(SparkUtils.java:87)
at org.apache.spark.shuffle.celeborn.HashBasedShuffleWriter.close(HashBasedShuffleWriter.java:385)
at org.apache.spark.shuffle.celeborn.HashBasedShuffleWriter.write(HashBasedShuffleWriter.java:176)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:180)
at org.apache.spark.scheduler.Task.run(Task.scala:147)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:716)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:86)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:83)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:97)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:719)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$3(DAGScheduler.scala:3122)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3122)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3114)
at scala.collection.immutable.List.foreach(List.scala:323)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3114)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1303)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1303)
at scala.Option.foreach(Option.scala:437)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1303)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3397)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3328)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3317)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:50)
Caused by: java.lang.NoSuchMethodError: 'org.apache.spark.scheduler.MapStatus org.apache.spark.scheduler.MapStatus$.apply(org.apache.spark.storage.BlockManagerId, long[], long)'
at org.apache.spark.shuffle.celeborn.SparkUtils.createMapStatus(SparkUtils.java:87)
at org.apache.spark.shuffle.celeborn.HashBasedShuffleWriter.close(HashBasedShuffleWriter.java:385)
at org.apache.spark.shuffle.celeborn.HashBasedShuffleWriter.write(HashBasedShuffleWriter.java:176)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:180)
at org.apache.spark.scheduler.Task.run(Task.scala:147)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:716)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:86)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:83)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:97)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:719)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
26/05/08 14:56:25 WARN ShuffleClientImpl: Shuffle client has been shutdown!

@cxzl25
Copy link
Copy Markdown
Contributor

cxzl25 commented May 9, 2026

I tried it, it is celeborn-client itself that does not support spark 4.1. Spark will throw following exception no matter auron is enabled or not.

Sorry, I forgot, Celeborn needs to be released in version 0.7.0 to support Spark4.1.

https://issues.apache.org/jira/browse/CELEBORN-2239

@cxzl25 cxzl25 merged commit 6065c55 into apache:master May 9, 2026
123 of 125 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can not compile auron targeting scalar 2.13 and spark 4 when celeborn is enabled

4 participants